Data Visualization Techniques

Venustiano Soancatl Aguilar

Content

  • The grammar of graphics
  • The major components of layers
  • Hands on practice
  • Visualizations based on the gg approach

The grammar of graphics


The grammar of graphics is about grammatical rules for creating perceivable graphs, or what we call graphics. (Leland Wilkinson, 2005).


Take the analogy: good grammar is just the first step in creating a good sentence.

An Object-Oriented Graphics System

  1. Specification
    1. DATA : a set of data operations that create variables from datasets,
    2. TRANS : variable transformations (e.g., rank),
    3. SCALE : scale transformations (e.g., log),
    4. COORD : a coordinate system (e.g., polar),
    5. ELEMENT : graphs (e.g., points) and their aesthetic attributes (e.g., color),
    6. GUIDE : one or more guides (axes, legends, etc.).
  2. Assembly
  3. Display

Graphics Pipeline

  • Algebra, the operations that allow us to combine variables and specify dimensions of graphs.
  • Scales involves the representation of variables on measured dimensions.
  • Statistics covers the functions that allow graphs to change their appearance and representation schemes.
  • Geometry covers the creation of geometric graphs from variables.

A layered grammar of graphics

Layers of the grammar of graphics


A layer is composed of

  1. data and aesthetic mappings
  2. a geometric object
  3. a statistical transformation
  4. a position adjustment

1. Data and aesthetic mapping

Aestetic mappings

2. Geometric objects

Geometric object

A sample of geometric objects

Graphical primitives

  • geom_path()
  • geom_rect()
  • geom_poligon()

One variable

  • Discrete
    • geom_bar()
  • Continuous
    • geom_histogram()
    • geom_density()

Two variables

  • Both continuous
    • geom_smooth()
    • geom_point()
  • At least one discrete
    • geom_count()
    • geom_jitter()
  • One continuous one discrete
    • geom_boxplot().
    • geom_violin()

Three variables

  • geom_contour()
  • geom_tile()
  • geom_raster()

Aesthetics mapping in practice

library(dviz.supp)
library(forcats)
library(lubridate)

if (!requireNamespace("gt")) install.packages("gt")
library(gt)
Daily temperature data
station_id month day temperature flag date location
USC00042319 01 1 51.0 S 0-01-01 Death Valley
USC00042319 01 2 51.2 S 0-01-02 Death Valley
USC00042319 01 3 51.3 S 0-01-03 Death Valley
USC00042319 01 4 51.4 S 0-01-04 Death Valley
USC00042319 01 5 51.6 S 0-01-05 Death Valley
USC00042319 01 6 51.7 S 0-01-06 Death Valley

Mapping and geometry

p <- ggplot(temps_long, 
            aes(x = date, 
                y = temperature, 
                color = location)
            ) +
  geom_line(linewidth = 1) +
  scale_x_date(name = "month", 
               limits = c(ymd("0000-01-01"), ymd("0001-01-04")),
               breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
                          ymd("0000-10-01"), ymd("0001-01-01")),
               labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(1/366, 0)) + 
  scale_y_continuous(limits = c(19.9, 107),
                     breaks = seq(20, 100, by = 20),
                     name = "temperature (°F)") +
  scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
  theme_dviz_grid() +
  theme(legend.title.align = 0.5)

Temperature plot

Seaborn and the Grammar of Graphics

# Create plot
fig, ax = plt.subplots(figsize=(9, 5))

# Use seaborn lineplot; pass palette by mapping
sns.lineplot(
    data=lf,
    x='date',
    y='temperature',
    hue='location',
    palette=palette_map,
    linewidth=1.5,  # similar to geom_line linewidth
    ax=ax
)

# X-axis limits and breaks (use valid years 2000-01-01 to 2001-01-04)
xmin = pd.to_datetime("2000-01-01")

Temperature plot using Seaborn

(np.float64(10957.0), np.float64(11326.0))
(19.9, 107.0)

Changing the geometry to heatmap

Preprocessing:

  • Compute mean by location & month
  • Replace month numbers with names
Mean temperature per month
location month mean
Death Valley Jan 53.45161
Death Valley Feb 59.94483
Death Valley Mar 68.44839
Death Valley Apr 76.29333
Death Valley May 86.60645
Death Valley Jun 95.54667

Aesthetics mapping and geometry

p <- ggplot(mean_temps, 
            aes(x = month, y = location, fill = mean)) + 
     geom_tile(width = .95, height = 0.95) +
     scale_fill_viridis_c(option = "B", begin = 0.15, end = 0.98,
                       name = "temperature (°F)") + 
     scale_y_discrete(name = NULL) +
     ...

3. Statistical transformations

Common statistical transformations


ggplot2 stat_ functions
Table adapted from Hadley Wickham (2016),
Name Description
bin Divide continuous range into bins, and count number of points in each
boxplot Compute statistics necessary for boxplot
contour Calculate contour lines
density Compute 1d density estimate
identity Identity transformation, f(x) = x
jitter Jitter values by adding small random value
qq Calculate values for quantile-quantile plot
quantile Quantile regression
smooth Smoothed conditional mean of y given x
summary Aggregate values of y for given x
unique Remove duplicated observations

Contours

Plot the blue jay relationship between body mass and head length.


Blue jay dataset
BirdID KnownSex BillDepth BillWidth BillLength Head Mass Skull Sex
0000-00000 M 8.26 9.21 25.92 56.58 73.30 30.66 1
1142-05901 M 8.54 8.76 24.99 56.36 75.10 31.38 1
1142-05905 M 8.39 8.78 26.07 57.32 70.25 31.25 1
1142-05907 F 7.78 9.30 23.48 53.77 65.50 30.29 0
1142-05909 M 8.71 9.84 25.47 57.32 74.90 31.85 1
1142-05911 F 7.28 9.30 22.25 52.25 63.90 30.00 0

Contour plot, first version

blue_jays_base <- ggplot(blue_jays, aes(Mass, Head)) + 
  scale_x_continuous(limits = c(57, 82), expand = c(0, 0), name = "body mass (g)") +
  scale_y_continuous(limits = c(49, 61), expand = c(0, 0), name = "head length (mm)" ) +
  theme_dviz_grid()

blue_jays_base + 
  stat_density_2d(color = "black", size = 0.4, binwidth = 0.004) +
  geom_point(color = "black", size = 1.5, alpha = 1/3)

Apply some shading

blue_jays_base + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon", color = "black", size = 0.15, binwidth = 0.004) +
  geom_point(color = "black", size = 1.5, alpha = .4) +
  scale_fill_gradient(low = "grey95", high = "grey70", guide = "none")

Grouping by sex

blue_jays_base + 
  aes(color = KnownSex) +
  stat_density_2d(size = 0.4, binwidth = 0.006) +
  geom_point(size = 1.5, alpha = 0.7) +
  ...

Bins

Common applications:

  • Histograms
  • Contours
  • Heatmaps, aggregate values into grid cells to display intensity across two dimensions
  • Temporal aggregation
  • Large-data intensity approximation

Mass spectrometry

Prompt: Given a pandas dataframes with more than 200 million rows and an 'mz' column having more than 26 million unique values. How can the table be aggregated in such a way that we can create a heat map wint 'mz' on the vertical axis, time on the horizontal axis and intensity on the 'z' axis (color)?

Id Time scanid index intensity mz
1 0.312346 35 376857 9 1548.487069
1 0.312346 38 2796 9 99.330436
1 0.312346 38 274380 9 979.736426
1 0.312346 38 313091 9 1179.351225
1 0.312346 39 322655 9 1231.517479
6481 691.492002 916 55969 86 200.037732
6481 691.492002 916 74584 138 243.540075
6481 691.492002 916 92301 62 288.915998
6481 691.492002 916 218889 53 725.852351
6481 691.492002 917 215210 99 710.363398

AI Response

To aggregate such a large Pandas DataFrame for creating a heatmap, you can use binning and grouping techniques to reduce the data into manageable chunks. Here’s a general approach to achieve this:


# Step 1: Downsample or aggregate the data if necessary to avoid memory issues
# Step 2: Create a pivot table (heatmap structure) by aggregating intensity
heatmap_data = super_df.pivot_table(index=pd.cut(super_df['mz'], bins=mz_bins), 
                              columns=pd.cut(super_df['Time'], bins=time_bins), 
                              values='intensity', 
                              aggfunc='mean')  # You can change 'mean' to 'sum' if appropriate

# Step 3: Apply a logarithmic transformation to highlight minority values
# Step 4: Plot the heatmap
plt.figure(figsize=(17, 10))  # Adjust figure size as needed
sns.heatmap(heatmap_data_log, cmap='magma', norm=None,cbar_kws={'label': 'Log(Intensity)'})  
# 'magma' gives more emphasis on high values

# Step 6: Show the plot
plt.show()

Result

4. Position adjustment

Position Adjustments in ggplot2

Position Adjustments in ggplot2
Position Description Commonly Used With
identity No adjustment — geoms are placed exactly where data specifies. geom_point(), geom_bar()
stack Stacks elements vertically along the y-axis. geom_bar(), geom_area()
fill Like 'stack', but scales bars to show proportions (fills to 100%). geom_bar(), geom_area()
dodge Places overlapping objects side-by-side for comparison. geom_bar(), geom_boxplot()
jitter Adds small random variation to reduce overplotting. geom_point()
nudge Moves text or labels slightly to improve readability. geom_text(), geom_label()

Position dodge

Using jitter to deal with occlusion

Partial transparency

Jitter